A Tutorial on Automated Text Categorisation
نویسنده
چکیده
The automated categorisation (or classification) of texts into topical categories has a long his-tory, dating back at least to 1960. Until the late ’80s, the dominant approach to the probleminvolved knowledge-engineering automatic categorisers, i.e. manually building a set of rulesencoding expert knowledge on how to classify documents. In the ’90s, with the booming pro-duction and availability of on-line documents, automated text categorisation has witnessedan increased and renewed interest. A newer paradigm based on machine learning has super-seded the previous approach. Within this paradigm, a general inductive process automaticallybuilds a classifier by “learning”, from a set of previously classified documents, the character-istics of one or more categories; the advantages are a very good effectiveness, a considerablesavings in terms of expert manpower, and domain independence. In this tutorial we look atthe main approaches that have been taken towards automatic text categorisation within thegeneral machine learning paradigm. Issues of document indexing, classifier construction, andclassifier evaluation, will be touched upon. 1 A definition of the text categorisation task Document categorisation (or classification) may be seen as the task of determining an assignmentof a value from {0,1} to each entry of the decision matrix d1 . . . . . . dj . . . . . . dnc1 a11 . . . . . . a1j . . . . . . a1n. . . . . . . . . . . . . . . . . . . . . . . .ci ai1 . . . . . . aij . . . . . . ain. . . . . . . . . . . . . . . . . . . . . . . .cm am1 . . . . . . amj . . . . . . amn where C ={c1, . . . ,cm} is a set of pre-defined categories, and D ={d1, . . . ,dn} is a set of doc-uments to be categorised (sometimes called “requests”). A value of 1 for aij is interpreted as adecision to file dj under ci, while a value of 0 is interpreted as a decision not to file dj under ci.Fundamental to the understanding of this task are two observations: • the categories are just symbolic labels. No additional knowledge of their “meaning” isavailable to help in the process of building the categoriser; in particular, this means that the“text” constituting the label (e.g. Sports in a news categorisation task) cannot be used; • the attribution of documents to categories should, in general, be attributed on the basisof the content of the documents, and not on the basis of metadata (e.g. publication date,
منابع مشابه
A Classification of Dialogue Actions in Tutorial Dialogue
In this paper we present a taxonomy of dialogue moves which describe the actions that students and tutors perform in tutorial dialogue. We are motivated by the need for a categorisation of such actions in order to develop computational models for tutorial dialogue. As such, we build both on existing work on dialogue move categorisation for tutorial dialogue as well as dialogue taxonomies for ge...
متن کاملMachine Learning in Automated Text Categorisation
The automated categorisation (or classification) of texts into topical categories has a long history, dating back at least to the early ’60s. Until the late ’80s, the most effective approach to the problem seemed to be that of manually building automatic classifiers by means of knowledgeengineering techniques, i.e. manually defining a set of rules encoding expert knowledge on how to classify do...
متن کاملThe Role of Automated Categorisation in e-Government Information Retrieval
High-precision search results are essential for supporting e-government employees’ information tasks. Prior studies have shown that existing features of e-government retrieval systems need improvement in terms of search facilities, navigation and metadata. This paper investigates how automated categorisation can enhance information organisation and retrieval and presents the results of a realis...
متن کاملUnsupervised categorisation approaches for technical support automated agents
In this paper we describe an unsupervised approach for the automated categorisation of utterances into predefined categories of symptoms (or problems) within the framework of a technical support automated agent. The utterance classification is performed based on an iterative K-means clustering method. In order to improve the lower accuracy typical of unsupervised algorithms, we have analysed tw...
متن کامل